We adapt here the tutorial with demo data (see file RNASeq_in_R_demo.html) to study a ‘real’ scenario.
The code used to generate this document along with the data that you will need to reproduce the results presented here are available from the github repository VoisinneG/RNA_seq_analysis_in_R
Clone this repository from the terminal with the command:
git clone "https://github.com/VoisinneG/RNA_seq_analysis_in_R"
First things first : create a R project to store data, code, output tables and figures in separate sub-folders. Just click on “File > New Project…”
Create a new project. Add directories ./data, ./R, ./output and ./figures. Create a new R script file script.R in the ./R directory. We will use it to copy paste useful command lines below. We can re-run all commands sequentially by clicking on the source button in the top right corner of the R studio editor.
You will have to report your analysis in the same format as this document. This document has been built from a Rmarkdown file using the knitr package. You can download the original .Rmd file here which could provide a useful starting point. You can find more information about Rmarkdown on the official website and in this cheatsheet .
Read count data and sample metadata files are available from the github repository you have just cloned in the data folder. You can copy these file in your ./data folder.
We can now start by loading the data into the RStudio environment.
seqdata <- read.delim("./data/S16082_allresV2_rawdata.csv",
stringsAsFactors = FALSE)
sampleinfo <- read.delim("./data/Sample_Info.csv")
You now have two objects in your environment. You can have a look at them using the head() or View() commands or by double clicking on an object within the environment. The seqdata object contains information about genes (one gene per row), the first column has the Entrez gene id, the second has the gene length and the remaining columns contain information about the number of reads aligning to the gene in each experimental sample.
names(seqdata)
## [1] "Ensembl.gene.id" "X1_28_Q_NT" "X2_57_Q_NT"
## [4] "X4_80_Q_NT" "X5_82_AM_NT" "X6_24_AM_NT"
## [7] "X7_91_AM_NT" "X8_47_Q_TNF" "X9_23_Q_TNF"
## [10] "X10_99_Q_TNF" "X11_100_Q_TNF" "X12_54_AM_TNF"
## [13] "X13_54_AM_TNF" "X15_94_AM_TNF" "X17_3_Q_IS"
## [16] "X18_7_Q_IS" "X19_83_Q_IS" "X20_84_Q_IS"
## [19] "X21_1_Q_5ASA" "X22_1_Q_5ASA" "X23_21_Q_5ASA"
## [22] "X24_22_Q_5ASA" "X26_49_AM_5ASA" "X27_49_AM_5ASA"
## [25] "X28_85_AM_5ASA" "X29_85_AM_5ASA" "X31_20_AM_5ASA"
## [28] "X32_20_AM_5ASA" "X35_68_AM_TNF_RCH" "X36_55_AM_TNF_RCH"
## [31] "X37_55_AM_TNF_RCH" "X38_90_AS_Cort" "X39_93_AS_Cort"
## [34] "X40_44_AS_Cort" "X41_96_AS_Cort" "X42_2"
## [37] "X43_9" "X44_18" "X45_41"
## [40] "X46_50" "Gene_name"
The sampleinfo file contains basic information about the samples that we will need for the analysis today.
sampleinfo
## sampleName Treatment Status Replicate
## 1 X1_28_Q_NT NT quiescent 1
## 2 X2_57_Q_NT NT quiescent 2
## 3 X4_80_Q_NT NT quiescent 4
## 4 X5_82_AM_NT NT active 1
## 5 X6_24_AM_NT NT active 2
## 6 X7_91_AM_NT NT active 3
## 7 X8_47_Q_TNF TNF quiescent 1
## 8 X9_23_Q_TNF TNF quiescent 2
## 9 X10_99_Q_TNF TNF quiescent 3
## 10 X11_100_Q_TNF TNF quiescent 4
## 11 X12_54_AM_TNF TNF active 1
## 12 X13_54_AM_TNF TNF active 2
## 13 X15_94_AM_TNF TNF active 4
## 14 X42_2 NT healthy 1
## 15 X43_9 NT healthy 2
## 16 X44_18 NT healthy 3
## 17 X45_41 NT healthy 4
## 18 X46_50 NT healthy 5
We will be manipulating and reformatting the counts matrix into a suitable format for downstream analysis. The first two columns in the seqdata dataframe contain annotation information. We need to make a new matrix countdata containing only the counts, but we can store the gene identifiers (the EntrezGeneID column) as rownames.
# Edit countdata columns
idx_remove <- which(! names(seqdata) %in% sampleinfo$sampleName)
countdata <- seqdata[,-idx_remove]
# Store Ensembl GeneID as rownames
rownames(countdata) <- seqdata[,1]
Note that the column names are now the same as SampleName in the sampleinfo file. This is good because it means our sample information in sampleinfo is in the same order as the columns in countdata. Let’s also simplify sampleinfo by putting SampleName as row names.
rownames(sampleinfo) <- sampleinfo[,1]
sampleinfo <- sampleinfo[, -1]
table(colnames(countdata)==rownames(sampleinfo))
##
## TRUE
## 18
Genes with very low counts across all libraries provide little evidence for differential expression and they interfere with some of the statistical approximations that are used later in the pipeline. They also add to the multiple testing burden when estimating false discovery rates, reducing power to detect differentially expressed genes. These genes should be filtered out prior to further analysis. Here we perform a minimal pre-filtering to keep only genes that have at least 1 read total.
keep <- rowMeans(countdata) >= 1
countdata <- countdata[keep, ]
If you want to normalize data and perform differential expression analysis, you can jump to section Creating a DESeqDataSet
The only information we have about genes is their ENSEMBL Gene ID, which is not very informative. We would like to add some annotation information. There are a number of ways to do this. We will demonstrate how to do this using the org.Mm.eg.db package.
First we need to decide what information we want. In order to see what we can extract we can run the columns function on the annotation database.
library(org.Hs.eg.db)
columns(org.Hs.eg.db)
## [1] "ACCNUM" "ALIAS" "ENSEMBL" "ENSEMBLPROT" "ENSEMBLTRANS"
## [6] "ENTREZID" "ENZYME" "EVIDENCE" "EVIDENCEALL" "GENENAME"
## [11] "GO" "GOALL" "IPI" "MAP" "OMIM"
## [16] "ONTOLOGY" "ONTOLOGYALL" "PATH" "PFAM" "PMID"
## [21] "PROSITE" "REFSEQ" "SYMBOL" "UCSCKG" "UNIGENE"
## [26] "UNIPROT"
Entries in this database are ENTREZID and not ENSEMBL IDs. Let’s get all information from all entries using the select function. We can now match ENTREZ IDs and ENSEMBL IDs. We will store our annotation information in a separate data frame :
annot <- AnnotationDbi::select(org.Hs.eg.db, keys=keys(org.Hs.eg.db), columns=c("ENTREZID", "ENSEMBL", "SYMBOL"))
idx_match <- match(row.names(countdata), annot$ENSEMBL)
annot <- annot[idx_match, ]
rownames(annot) <- rownames(countdata)
Let’s check that the ENSEMBL column matches the countdata rownames.
table(annot$ENSEMBL==rownames(countdata))
##
## TRUE
## 20271
You might have noticed that some symbols appear as NA. That is because there is not a one-to-one match between EntrezID and Ensembl gene IDs. Gene names are also in seqdata, so we’ll use these instead.
annot$Gene_name <- seqdata$Gene_name[match(rownames(annot), seqdata$Ensembl.gene.id)]
We are working we raw count data in this section. Data normalization and transformation will come later.
To explore a count matrix, it is often instructive to look at it as a heatmap. Below we show how to produce such a heatmap. We focus on the 20 genes with the highest average counts.
library("pheatmap")
select <- order(rowMeans(countdata), decreasing=TRUE)[1:20]
pheatmap(log10(countdata[select,]+1),
cluster_rows=FALSE,
cluster_cols=FALSE)
Look up the documentation for pheatmap in Rstudio. The parameter annotation_col allow us to add annotation columns. We’ll also cluster rows and columns , scale values by rows and change row labels using gene symbol.
pheatmap(log10(countdata[select,]+1),
show_rownames=TRUE,
cluster_cols=TRUE,
annotation_col=sampleinfo[ , c("Treatment","Status")] ,
scale = "row",
labels_row = annot$Gene_name[select]
)
If you like interactive objects, you could try the heatmaply() function to generate the heatmap.
library(heatmaply)
heatmaply(log10(countdata[select,]+1), scale = "row")
Data visualization with ggplot2 works with data frames. Here, we convert our data into a data frame where all count values are stored in the same column named value (long format). This will allow us to fully exploit ggplot2 features.
library(reshape2)
library(dplyr)
df <- countdata
df$GeneID <- rownames(countdata)
df_melt <- melt(df, id.vars = "GeneID")
df_melt <- rename(df_melt, sample = variable)
As an example, we can now plot the distribution of log10+1 transformed count values for each sample (You can offset histograms for better readability using the ggridges package).
plot <- ggplot(df_melt, aes(x=log10(value+1), fill = sample)) +
geom_density(alpha = 0.5)
plot
Box plots allow to see a summary of these distributions.
plot <- ggplot(df_melt, aes(x=sample, y = log10(value+1), fill = sample)) +
theme(axis.text.x = element_text(angle = 90)) +
geom_boxplot(alpha = 0.5)
plot
Some samples seem to differ quite neatly from the other samples. Some normalization between samples will be needed before going further into the analysis.
Here we use the dplyr package to group rows and return group summary (see the corresponding chapter in R for data science ). It makes it easy to compute the median count value per sample.
df_median <-
df_melt %>%
group_by(sample) %>%
summarise(median = median(value, na.rm = TRUE))
df_median
## # A tibble: 18 x 2
## sample median
## <fct> <dbl>
## 1 X1_28_Q_NT 31
## 2 X2_57_Q_NT 33
## 3 X4_80_Q_NT 20
## 4 X5_82_AM_NT 46
## 5 X6_24_AM_NT 39
## 6 X7_91_AM_NT 45
## 7 X8_47_Q_TNF 32
## 8 X9_23_Q_TNF 52
## 9 X10_99_Q_TNF 17
## 10 X11_100_Q_TNF 11
## 11 X12_54_AM_TNF 32
## 12 X13_54_AM_TNF 42
## 13 X15_94_AM_TNF 35
## 14 X42_2 43
## 15 X43_9 50
## 16 X44_18 63
## 17 X45_41 31
## 18 X46_50 54
Build a data frame with the median and mean count per gene across samples.
Note that normalization will also be carried out later during differential expression analysis. Here we show how we can use the dplyr package to normalize the data using the median. In passing, we also log transform the data.
median_mean <- mean(df_median$median)
df_melt <-
df_melt %>%
group_by(sample) %>%
mutate(value_norm = value/median(value, na.rm = TRUE)*median_mean,
value_norm_log = log10(value_norm + 1))
ggplot(df_melt, aes(x=sample, y = value_norm_log, fill = sample)) +
theme(axis.text.x = element_text(angle = 90)) +
geom_boxplot(alpha = 0.5)
We can build a normalized data frame using the reverse transform dcast:
data.norm.log <- dcast(df_melt, GeneID~sample, value.var = "value_norm_log")
rownames(data.norm.log) <- data.norm.log$GeneID
data.norm.log <- data.norm.log[-1]
Things get more interesting when we include metadata. Let’s map sample information to df_melt
idx_match <- match(df_melt$sample, rownames(sampleinfo))
df_melt$Status <- sampleinfo$Status[idx_match]
df_melt$Treatment <- sampleinfo$Treatment[idx_match]
Now we have access to two more variables, Treatment and Status, that we could use to form groups.
ggplot(df_melt, aes(x = Treatment, y = value_norm_log, fill = Treatment)) +
geom_bar(alpha = 0.5, stat = "summary", fun.y = "median", position = "dodge") +
facet_wrap(~Status)
We can also focus on a particular set of GeneIDs and split plots using facet_grid. Let’s add an unpaired t-test on top of that.
library(ggsignif)
idx_select <- df_melt$GeneID %in% df_melt$GeneID[1]
ggplot(df_melt[idx_select, ], aes(x = Treatment, y = value_norm_log, fill = Treatment)) +
geom_bar(alpha = 0.5, stat = "summary", fun.y = "median", position = "dodge") +
geom_point(position = position_jitter(width = 0.25, height = 0))+
scale_y_continuous(expand = expand_scale(add=1)) +
geom_signif(comparisons = list(1:2),
na.rm = TRUE, test = "t.test",
test.args = list("paired" = FALSE),
position = "identity") +
facet_grid(GeneID~Status)
To evaluate the quality of the data, we compare values across samples using the sample correlation matrix.
data.norm.log <- dcast(df_melt, GeneID ~ sample, value.var = "value_norm_log")
rownames(data.norm.log) <- data.norm.log$GeneID
data.norm.log <- data.norm.log[-1]
corr_mat = cor( data.norm.log )
heatmaply(corr_mat)
Let’look more in more details at two samples :
library(ggpointdensity)
ggplot(data.norm.log, aes_string(x = rownames(sampleinfo)[3], y = rownames(sampleinfo)[4])) +
geom_pointdensity(size = 0.3, alpha = 0.1 ) +
scale_color_viridis() +
annotate("segment", x =0, y =0, xend = 7, yend = 7, linetype = "dashed")
Principal component analyisis (PCA) is a great tool to see the overall “shape” of the data. It allows to identify which samples are similar to one another and which are very different. This can enable us to identify groups of samples that are similar and work out which variables make one group different from another. We use the prcomp function to run the PCA. Usually, features (here genes) are columns while observation (here samples) are rows so we’ll transpose the previously median normalized data.norm.log before running the PCA.
pca_res <- prcomp( t(data.norm.log) )
summary(pca_res)
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6
## Standard deviation 26.2502 20.9285 14.68125 13.84242 12.17558 10.74431
## Proportion of Variance 0.2698 0.1715 0.08438 0.07501 0.05804 0.04519
## Cumulative Proportion 0.2698 0.4412 0.52562 0.60063 0.65866 0.70386
## PC7 PC8 PC9 PC10 PC11 PC12 PC13
## Standard deviation 10.34392 9.52615 9.07145 8.71818 8.3053 7.92569 7.79867
## Proportion of Variance 0.04189 0.03553 0.03222 0.02976 0.0270 0.02459 0.02381
## Cumulative Proportion 0.74575 0.78127 0.81349 0.84324 0.8702 0.89484 0.91865
## PC14 PC15 PC16 PC17 PC18
## Standard deviation 7.59092 7.37775 7.12285 6.70914 6.095e-14
## Proportion of Variance 0.02256 0.02131 0.01986 0.01762 0.000e+00
## Cumulative Proportion 0.94121 0.96252 0.98238 1.00000 1.000e+00
You obtain 18 principal components, each one explaining a percentage of the total variation in the dataset (PC1 explains 27%, PC2 17% and so on). The relationship (correlation or anticorrelation, a.k.a loadings) between the initial variables and the principal components is in $rotation. Let’s see which genes have the greatest loadings
max_loading <- apply(pca_res$rotation, 1, max)
pca_max <- apply(pca_res$rotation, 1, which.max)
idx_select<- order(max_loading, decreasing = TRUE)[1:20]
annotation_row <- data.frame(pc = factor(pca_max[idx_select]))
rownames(annotation_row) <- rownames(data.norm.log)[idx_select]
Let’s look at the heatmap for these genes:
pheatmap(data.norm.log[idx_select, ],
show_rownames=TRUE,
annotation_col=sampleinfo[ , c("Treatment","Status")],
annotation_row = annotation_row,
labels_row = annot$Gene_name[ match(rownames(data.norm.log)[idx_select], rownames(annot))]
)
You can also select genes with the highest positive and negative loadings on a given PC:
idx_select<- c(order(pca_res$rotation[, 1], decreasing = TRUE)[1:10],
order(pca_res$rotation[, 1], decreasing = FALSE)[1:10])
annotation_row <- data.frame(pc = c(rep("up", 10), rep("down", 10) ))
rownames(annotation_row) <- rownames(data.norm.log)[idx_select]
pheatmap(data.norm.log[idx_select, ],
show_rownames=TRUE,
annotation_col=sampleinfo[ , c("Treatment","Status")],
annotation_row = annotation_row,
labels_row = annot$Gene_name[ match(rownames(data.norm.log)[idx_select], rownames(annot))]
)
The values of each sample in terms of the principal components is in $x. Check taht the row names pca_res$x are the same as that of sampleinfo:
pca_res$x[ , 1:2]
## PC1 PC2
## X1_28_Q_NT -17.393326 13.2705716
## X2_57_Q_NT -5.804470 24.2066753
## X4_80_Q_NT 13.687838 -3.9781948
## X5_82_AM_NT 28.012442 -1.4174009
## X6_24_AM_NT -17.055372 10.4438811
## X7_91_AM_NT -5.219547 -3.5208188
## X8_47_Q_TNF -6.839841 14.2835688
## X9_23_Q_TNF 12.275765 17.2709704
## X10_99_Q_TNF -30.876275 -37.4686320
## X11_100_Q_TNF -21.090980 -61.7823048
## X12_54_AM_TNF 80.542757 -13.5139218
## X13_54_AM_TNF 22.919972 -6.5191274
## X15_94_AM_TNF 16.617856 0.5096783
## X42_2 -19.107976 8.8780438
## X43_9 -4.084852 10.2593812
## X44_18 -8.399046 9.4286515
## X45_41 -22.425481 17.8864512
## X46_50 -15.759463 1.7625274
We’ll use the ggplot2 package to plot the results of the PCA. See the R for data science book for an introduction to ggplot2. We first need to create a data frame from the pca results.
df_pca <- as.data.frame(pca_res$x)
df_pca$sample <- rownames(df_pca)
names(df_pca)
## [1] "PC1" "PC2" "PC3" "PC4" "PC5" "PC6" "PC7" "PC8"
## [9] "PC9" "PC10" "PC11" "PC12" "PC13" "PC14" "PC15" "PC16"
## [17] "PC17" "PC18" "sample"
We can now use the ggplot() function and choose to draw one point per sample.
library(ggplot2)
library(ggrepel)
pca_plot <- ggplot(df_pca, aes(x=PC1, y=PC2, color = sample, label=sample)) +
geom_point(show.legend = FALSE) +
geom_text_repel(show.legend = FALSE)
pca_plot
As an exercise, check that the row names pca_res$x are the same as that of sampleinfo (check it) and map metadata directly to the df_pca dataframe. Color points according to the sample Status or Treatment.
At some point in the analysis, we might want to perform pairwise comparisons between genes. Let’s first create a data frame (note that we use the transpose of the countdata matrix so available variables are in names(df)).
df <- as.data.frame(t(data.norm.log))
Plotting the data for two different genes is a good way to identify functionnal relationships between them. Let’s pick two genes,
library(ggplot2)
xvar <- as.name(names(df)[1])
yvar <- as.name(names(df)[2])
p <- ggplot(df, aes_string(x=xvar, y=yvar)) + geom_point(alpha = 0.5)
p
and fit a linear model to the data and compute Pearson’s correlation coefficient.
lm_res <- lm(formula =`ENSG00000000005` ~ `ENSG00000000003`, data = df)
sm<-summary.lm(lm_res)
sqrt(sm$r.squared)
## [1] 0.858598
Now we want to repeat this for all pairs of genes (we use only a subset of genes) and build a correlation matrix
library(Hmisc)
idx_select<- order(max_loading, decreasing = TRUE)[1:20]
corr <- Hmisc::rcorr(t(data.norm.log[idx_select, ]))
pheatmap(corr$r, fontsize = 4)
We can explore correlations by constructing a data frame with the correlation corefficient, its associated p-value and the number of points used to compute the correlation
library(reshape2)
df_R_coeff <- melt(corr$r)
df_R_coeff$variable <- "R"
df_pval <- melt(corr$P)
df_pval$variable <- "P"
df_nval <- melt(corr$n)
df_nval$variable <- "n"
df_corr <- rbind(df_R_coeff, df_pval, df_nval)
df_corr <- dcast(df_corr, Var1 + Var2 ~ variable, mean)
df_corr$gene_name_1 <- annot$Gene_name[match(df_corr$Var1, rownames(annot))]
df_corr$gene_name_2 <- annot$Gene_name[match(df_corr$Var2, rownames(annot))]
df_corr <- df_corr[df_corr$Var1 != df_corr$Var2, ]
We can look at the whole correlation data frame using the interactive function datatable. It is convenient if you wish to search for certain variables or order results.
library(DT)
DT::datatable(df_corr)
Now we can filter and keep only the most correlated pairs
df_corr_filtered <- df_corr[!is.na(df_corr$P) & df_corr$n>10 & df_corr$P<0.05 & df_corr$R>0.9, ]
We use that restricted set to build a network.
library(igraph)
net <- igraph::graph.data.frame(df_corr_filtered[ , c("gene_name_1", "gene_name_2")], directed=FALSE)
net <- igraph::simplify(net)
clusters <- igraph::cluster_fast_greedy(as.undirected(net))
group = clusters$membership
names(group) <- clusters$names
And we render it in an interactive environment
library(networkD3)
net_d3 <- networkD3::igraph_to_networkD3(net, group = group)
p <- forceNetwork(Links = net_d3$links, Nodes = net_d3$nodes,
Source = 'source', Target = 'target',
fontFamily = "arial",
NodeID = 'name', Group = 'group',
colourScale = JS("d3.scaleOrdinal(d3.schemeCategory20);"),
charge = -10, opacity = 1,
linkColour = rgb(0.75, 0.75, 0.75),
fontSize = 12, bounded = TRUE, zoom=TRUE, opacityNoHover = 1)
p
Alternatively, we can save it as a text file and use another software such as cytoscape to manipulate this network.
dir.create("./output/")
write.table(df_corr_filtered, file = "./output/df_corr_filtered.txt", sep = "\t", row.names = FALSE, quote = FALSE)
write.table(data.frame(clusters$membership, clusters$names), file = "./output/group.txt", sep = "\t", row.names = FALSE, quote = FALSE)
Exercise: add an extra column df$score <- 1:length(rownames(df)) that contains a score associated to each sample to our data frame. Which genes most strongly correlate with this score?
DESeqDataSetWe will use the DESeq2 package to conduct the core steps of the analysis. It should be already installed so we just need to load it in the RStudio session.
library(DESeq2)
We can use the help("DESeq2-package") command to browse the package documentation in Rstudio. We will use the DESeqDataSetFromMatrix function to build a DESeqDataSet object.
dds <- DESeqDataSetFromMatrix(countData = countdata,
colData = sampleinfo,
design = ~ Treatment + Status)
The design indicates how to model the samples, i.e how the counts for each gene depend on the variables in colData = sampleinfo. Here we want to measure the effect of the cell type and of the pregnancy status of the mice. Note that the two factor variables Treatment and Status should be columns of sampleinfo. Now the dds object contains count data along with the metadata and the experiment design. The count data is obtained using counts(dds) and the metadata is obtained using colData(dds).
Now that we have a DESeqDataSet object, we can analyse the data using the many tools available in the DESeq2 package.
The standard differential expression analysis steps are wrapped into a single function, DESeq. It is then straightforward to perform differential expression analysis.
dds <- DESeq(dds)
We can see the comparisons carried out by DESeq using resultsNames().
resultsNames(dds)
## [1] "Intercept" "Treatment_TNF_vs_NT"
## [3] "Status_healthy_vs_active" "Status_quiescent_vs_active"
Results tables are generated using the function results, which extracts a results table with log2 fold changes, p values and adjusted p values. We specify which coomparison we want to extract using the name parameter. Here we choose to compare basal and luminal cell types. We select significant results with adjusted p-values (correction for multiple tests computed with the Benjamini–Hochberg procedure) below 0.01 using alpha.
res <- results(dds, alpha = 0.05, pAdjustMethod="BH", name="Status_quiescent_vs_active")
head(res)
## log2 fold change (MLE): Status quiescent vs active
## Wald test p-value: Status quiescent vs active
## DataFrame with 6 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## ENSG00000129451 26.1748456742507 -0.59412036935381 0.2587337780261
## ENSG00000205420 0.982580594914044 -2.93453103817323 2.10952255838686
## ENSG00000169035 7.80444717174889 -0.888004457168178 0.488624734035383
## ENSG00000088340 685.378954759117 -1.04974882266852 0.330274957645292
## ENSG00000109511 5.28342772962168 -2.44407244320362 1.39680814205695
## ENSG00000071242 443.601832207341 -0.365417424016576 0.187158049050984
## stat pvalue padj
## <numeric> <numeric> <numeric>
## ENSG00000129451 -2.29626133041615 0.0216609437059134 0.237524902153958
## ENSG00000205420 -1.39108777315814 0.164198812074885 NA
## ENSG00000169035 -1.81735470047629 0.0691628192437002 0.420362197594208
## ENSG00000088340 -3.17840877235365 0.0014808581113695 0.0483347219798127
## ENSG00000109511 -1.74975529538685 NA NA
## ENSG00000071242 -1.95245369285204 0.0508843636655219 0.365056173443204
We can get a summary of these results using summary()
summary(res)
##
## out of 31010 with nonzero total read count
## adjusted p-value < 0.05
## LFC > 0 (up) : 118, 0.38%
## LFC < 0 (down) : 633, 2%
## outliers [1] : 295, 0.95%
## low counts [2] : 7162, 23%
## (mean count < 5)
## [1] see 'cooksCutoff' argument of ?results
## [2] see 'independentFiltering' argument of ?results
Let’s select genes have an a adjusted p-value lower than 0.05 and a fold-change greater than 2^2:
enrich <- data.frame(ID=rownames(res), res@listData)
enrich <- enrich %>%
filter(padj < 0.05, log2FoldChange > 2) %>%
arrange(desc(log2FoldChange))
enrich %>% head(10)
## ID baseMean log2FoldChange lfcSE stat pvalue
## 1 ENSG00000260075 14.495252 7.445833 1.699484 4.381232 1.180101e-05
## 2 ENSG00000156284 787.177133 5.468534 0.873294 6.261962 3.801635e-10
## 3 ENSG00000169271 30.965038 4.806102 1.382521 3.476333 5.083214e-04
## 4 ENSG00000167759 5.326531 4.628798 1.024176 4.519533 6.197609e-06
## 5 ENSG00000277010 5.556916 4.551041 1.200091 3.792246 1.492906e-04
## 6 ENSG00000091583 5.308916 4.535677 1.131813 4.007444 6.137949e-05
## 7 ENSG00000179796 5.892397 4.471098 1.124978 3.974386 7.056097e-05
## 8 ENSG00000178084 4.936564 4.370418 1.071060 4.080460 4.494669e-05
## 9 ENSG00000257612 9.577385 4.103931 1.249055 3.285629 1.017551e-03
## 10 ENSG00000080224 7.100705 4.057451 1.038739 3.906131 9.378583e-05
## padj
## 1 1.561512e-03
## 2 4.712627e-07
## 3 2.380217e-02
## 4 9.930087e-04
## 5 9.686619e-03
## 6 5.200255e-03
## 7 5.730767e-03
## 8 4.174302e-03
## 9 3.798156e-02
## 10 6.946345e-03
Have a look at the most enriched gene
#get normalized data
d <- plotCounts(dds, gene=as.character(enrich$ID[1]), intgroup=c("Treatment","Status"),
returnData=TRUE)
library("ggplot2")
p <- ggplot(d, aes(x=Status, y=log10(count+1))) +
geom_point(position=position_jitter(w=0.1,h=0)) +
facet_wrap(~Treatment)
p
We can also facet the plot according to Status. It could be interesting to see how the Treatment effect depends on Status. See section Interactions to address such questions.
ggplot(d, aes(x=Treatment, y=log10(count+1))) +
geom_point(position=position_jitter(w=0.1,h=0)) +
facet_wrap(~Status)
We have used default parameters here. Data normalization and model fitting have been performed in the background by DESeq. If you want to perform these steps separately, have a look at the [Additional Material] section.
Interaction terms allow to test, for example, if the log2 fold change attributable to a given condition is different based on another factor, for example if the Treatment effect differs across Status.
Interaction terms can be added to the design formula but a simpler approach to take into account interactions consists in: -combining the factors of interest into a single factor with all combinations of the original factors -changing the design to include just this factor, e.g. ~ group
dds.int <- dds
dds.int$group <- factor(paste(dds.int$Status, dds.int$Treatment, sep="."))
design(dds.int) <- ~ group
dds.int <- DESeq(dds.int)
resultsNames(dds.int)
## [1] "Intercept" "group_active.TNF_vs_active.NT"
## [3] "group_healthy.NT_vs_active.NT" "group_quiescent.NT_vs_active.NT"
## [5] "group_quiescent.TNF_vs_active.NT"
Now we can compare active and quiescent patients that received the TNF treatment.
#res.int <- results(dds.int, contrast = c("group", "active.TNF", "quiescent.TNF"))
res.int <- results(dds.int, contrast = c("group", "active.NT", "quiescent.NT"))
res.int
## log2 fold change (MLE): group active.NT vs quiescent.NT
## Wald test p-value: group active.NT vs quiescent.NT
## DataFrame with 31010 rows and 6 columns
## baseMean log2FoldChange lfcSE
## <numeric> <numeric> <numeric>
## ENSG00000129451 26.1748456742507 0.937847364045463 0.37466711622884
## ENSG00000205420 0.982580594914044 4.47830019282281 5.14574796546978
## ENSG00000169035 7.80444717174889 0.546284275261462 0.740871487912225
## ENSG00000088340 685.378954759117 0.673352579982499 0.480607726904312
## ENSG00000109511 5.28342772962168 -2.37222965387234 2.10174490238311
## ... ... ... ...
## ENSG00000283557 1.36687788738831 0.699331388957992 1.56315664140943
## ENSG00000283580 2.75420568937431 0.574170190826705 1.10645358085756
## ENSG00000283597 2.08568607229969 0.688426276392395 1.47378995984917
## ENSG00000283638 1.17573897595086 1.91454074538724 2.28486824420557
## ENSG00000283667 2.41739220935345 1.5272315290754 1.17506471764344
## stat pvalue padj
## <numeric> <numeric> <numeric>
## ENSG00000129451 2.50314832399821 0.012309394506137 0.996750113873654
## ENSG00000205420 0.870291398427237 0.384141178398879 NA
## ENSG00000169035 0.737353622287302 0.460907331057749 NA
## ENSG00000088340 1.4010440163326 0.161200911051718 0.996750113873654
## ENSG00000109511 -1.12869532890625 NA NA
## ... ... ... ...
## ENSG00000283557 0.447384075550762 0.654597771648423 NA
## ENSG00000283580 0.51892840401103 0.603810669441014 NA
## ENSG00000283597 0.467112882532358 0.640419117769962 NA
## ENSG00000283638 0.837921727102873 0.40207466528852 NA
## ENSG00000283667 1.29969992813521 0.193703835017983 NA
Alternatively, you can create another object keeping only the samples you want to compare :
samples <- rownames(sampleinfo)[sampleinfo$Status %in% c("quiescent", "active") &
sampleinfo$Treatment == "NT"]
dds.focus <- DESeqDataSetFromMatrix(countData = countdata[,samples],
colData = sampleinfo[rownames(sampleinfo) %in% samples, ],
design = ~ Status)
dds.focus <- DESeq(dds.focus)
resultsNames(dds.focus)
## [1] "Intercept" "Status_quiescent_vs_active"
In DESeq2, the function plotMA shows the log2 fold changes attributable to a given variable over the mean of normalized counts for all the samples in the DESeqDataSet.
plotMA(res.int, ylim=c(-5, 5))
Volcano plots allow to visualize p-values and fold-change. We add gene names for a set of selected genes
df <- as.data.frame(res.int@listData)
df$ID <- res.int@rownames
df$names <- annot$Gene_name[match(df$ID, rownames(annot))]
select <- abs(df$log2FoldChange) > 2 & df$padj < 0.05
ggplot(df, aes(x=log2FoldChange, y=-log10(padj), label = names)) +
geom_point(color = "gray") +
geom_point(data = df[select, ], color = "red") +
geom_text_repel(data = df[select, ], color = "black",
min.segment.length = 0, segment.colour = "gray")
Now that we have identified a set of genes that are differentially expressed between basal and luminal conditions, we can use available gene annotations to figure out whether this set of genes is enriched in terms corresponding to specific biological processes, molecular functions or pathways. We’ll use the clusterProfiler package to perform this analysis on GO annotation terms corresponding to molecular functions (GO MF ontology).
library(clusterProfiler)
df <- as.data.frame(res.int@listData)
df$ID <- res.int@rownames
df$names <- annot$Gene_name[match(df$ID, rownames(annot))]
select <- abs(df$log2FoldChange) > 2 & df$padj < 0.05
enrichgo_result_over_MF = enrichGO( gene = df$ID[select],
keyType = "ENSEMBL",
universe = df$ID,
OrgDb = org.Hs.eg.db,
ont = "MF",
pvalueCutoff = 0.05,
qvalueCutoff = 0.2,
readable = TRUE)
Results are stored in the result slot
datatable( enrichgo_result_over_MF@result )
Check the clusterProfiler documentation if you wish to look at other ontologies.
For instance, for KEGG pathways :
df$EntrezID <- annot$ENTREZID[match(df$ID, rownames(annot))]
enrichgo_result_over_KEGG = enrichKEGG( gene = df$EntrezID[select], organism = "hsa",
keyType = "kegg",
universe = df$EntrezID,
pvalueCutoff = 0.05,
qvalueCutoff = 0.2)
Complementary to the previous approach, GSEA (Gene Set Enrichment Analysis) consists in analyzing how sets of genes of interest (also called “genesets” or “signatures”) are distributed along some ranking (i.e. of ratios) between two conditions (i.e. luminal versus basal). It calculates an enrichment score and some statistics (p-value and FDR) for each geneset. Download the gene set file from here. Several sets can be dowloaded. We choose the Hallmark gene sets with “gene symbols” and save it in ./data. You should now have a file h.all.v7.1.symbols.gmt in the ./data folder. Let’s read it:
#read the geneset file
pathwaysH <- read.csv("./data/h.all.v7.1.symbols.gmt", sep="\t", header=F, stringsAsFactor=FALSE)
The gene set file needs to be in a specific format so we need to do a little formatting before using the fgsea() function:
#read the geneset file
# the format is different than when we do: load("human_H_v5p2.rdata")
# this causes a problem.
genesets <- lapply( 1:nrow(pathwaysH), function(x){
row <- pathwaysH[x, ]
current_string = paste( unlist( row[ 3:length( row)], use.names = FALSE), collapse="\t")
current_string <- gsub("[\t]+$","", current_string)
current_string <- strsplit(current_string, split="\t")[[1]]
return(current_string)}
)
names(genesets) <- pathwaysH[[1]]
We also need to rank genes. We choose to rank them according to the log2 fold-change stored in our res object. But before that, we need to remove duplicate genes in order to keep only distinct gene symbols. We decide to keep the most regulated ones.
ranks <- data.frame(ID = rownames(res.int), log2FC = res.int$log2FoldChange)
idx_match <- match(rownames(res.int), rownames(annot))
ranks$symbol <- toupper(annot$Gene_name[idx_match])
ranks <- ranks[order(abs(ranks$log2FC), decreasing = TRUE), ]
ranks <- distinct(ranks, symbol, .keep_all = TRUE)
ranks <- ranks[order(ranks$log2FC, decreasing = TRUE), ]
stats <- ranks$log2FC
names(stats) <- ranks$symbol
Run GSEA
library(fgsea)
fgseaRes <- fgsea(pathways = genesets, stats = stats, minSize=15, maxSize = 500, nperm=1000)
Lets look at the top 10 most enriched gene sets:
topUp <- fgseaRes %>%
filter(NES > 0) %>%
arrange(padj, desc(NES)) %>%
head(10)
topUp %>% dplyr::select(pathway, padj, NES)
## pathway padj NES
## 1: HALLMARK_TNFA_SIGNALING_VIA_NFKB 0.006038647 2.372715
## 2: HALLMARK_INFLAMMATORY_RESPONSE 0.006038647 2.217850
## 3: HALLMARK_INTERFERON_GAMMA_RESPONSE 0.006038647 1.875773
## 4: HALLMARK_IL6_JAK_STAT3_SIGNALING 0.006038647 1.865440
## 5: HALLMARK_COAGULATION 0.006038647 1.826556
## 6: HALLMARK_ALLOGRAFT_REJECTION 0.006038647 1.815223
## 7: HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION 0.006038647 1.803157
## 8: HALLMARK_COMPLEMENT 0.006038647 1.761225
## 9: HALLMARK_IL2_STAT5_SIGNALING 0.006038647 1.599081
## 10: HALLMARK_KRAS_SIGNALING_UP 0.006038647 1.575653
and at the top 10 most depleted gene sets :
topDown <- fgseaRes %>%
filter(NES < 0) %>%
arrange(padj, desc(NES)) %>%
head(10)
topDown %>% dplyr::select(pathway, padj, NES)
## pathway padj NES
## 1: HALLMARK_MYC_TARGETS_V2 0.01785714 -1.6116186
## 2: HALLMARK_PANCREAS_BETA_CELLS 0.03424658 -1.6672755
## 3: HALLMARK_G2M_CHECKPOINT 0.03424658 -1.9768271
## 4: HALLMARK_E2F_TARGETS 0.03424658 -2.5547823
## 5: HALLMARK_OXIDATIVE_PHOSPHORYLATION 0.03424658 -2.6196012
## 6: HALLMARK_MYC_TARGETS_V1 0.03424658 -2.7515246
## 7: HALLMARK_FATTY_ACID_METABOLISM 0.11595547 -1.2391551
## 8: HALLMARK_DNA_REPAIR 0.28282828 -1.1641585
## 9: HALLMARK_BILE_ACID_METABOLISM 0.47892720 -1.0781748
## 10: HALLMARK_PROTEIN_SECRETION 0.71674832 -0.9906032
Let’see how the genes within a pathway are distributed along the ranked genes:
plotEnrichment(genesets[[topUp$pathway[1]]], stats)
We can also look at several pathways at once:
topPathways <- bind_rows(topUp, topDown) %>%
arrange(-ES)
plotGseaTable(genesets[topPathways$pathway],
stats,
fgseaRes,
gseaParam = 0.5)
Re-run the analysis using the GO MF gene set. What differentiate GSEA from our previous annotation enrichment analysis?